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Abstract 



^ c*| In this paper, we develop an approach to recursively estimate the quadratic risk for matrix recovery 

problems regularized with spectral functions. Toward this end, in the spirit of the SURE theory, a key 
step is to compute the (weak) derivative and divergence of a solution with respect to the observations. 
As such a solution is not available in closed form, but rather through a proximal splitting algorithm, 
we propose to recursively compute the divergence from the sequence of iterates. A second challenge 
that we unlocked is the computation of the (weak) derivative of the proximity operator of a spectral 
function. To show the potential applicability of our approach, we exemplify it on a matrix completion 
qq ■ problem to objectively and automatically select the regularization parameter. 



I/"") \ 1. Introduction 

o 

^NJ ■ Consider the problem of estimating a matrix Xo G ^™i xn 2 f rom p noisy observations y = A{Xq) + w G M. F , 
where w ~ A/"(0, er 2 Idp). The linear bounded operator A : R™ lX ™ 2 — >• M. p entails loss of information such that 
the problem is ill-posed. This problem arises in various research fields. Because of ill-posedness, side information 
through a regularizing term is necessary. We thus consider the problem 

X 

Si X(y) G Argmin \\\y - A(X)\\ 2 + \J{X) (1) 

where the set of minimizers is assumed non-empty, A > is a regularization parameter and J : R™ 1 x ™ 2 — > KU{oo} 
is a proper lower semi-continuous (lsc) convex regularizing function that imposes the desired structure on X(y). 
In this paper, we focus on the case where J is a convex spectral function, that is a symmetric convex function 
of the singular values of its argument. Spectral regularization can account for prior knowledge on the spectrum 
of Xo, typically low-rank (see e.g. Fazel, 2002). 

In practice, the choice of the regularization parameter A in (1) remains an important problem largely unexplored. 
Typically, we want to select A minimizing the quadratic risk E u) |A(j/) — A"o| 2 . Since Xo is unknown and X(y) is 
non-unique, one can instead consider an unbiased estimate of the prediction risk E w |^4(X(y)) — >4(Xo)| 2 , where 
it can be easily shown that fi(y) = A(X(y)) is a single- valued mapping. With the proviso that [i(y) is weakly 
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diffcrcntiable, the SURE (for Stein unbiased risk estimator, Stein, 1981) 

SURE(y) = \\y-f,(y))\\ 2 - Pa 2 +2a 2 div fx(y) (2) 

is an unbiased estimate of the prediction risk, where div fx(y) = Tr (9/z(j/)), and d[i(y) stands for the (weak) 
Jacobian of fi(y). The SURE depends solely on y, without prior knowledge of Xq and then can prove very useful 
as a basis for automatic ways to choose the regularization parameters A. 

Contributions. Our main contribution is to provide the derivative of matrix- valued spectral functions where 
the matrices have distinct singular values which extends the result of Lewis & Sendov (2001) to non-symmetric 
square matrices. This result is used to recursively compute the derivative of any solution of spectrally regularized 
inverse problems by solving (1). This is achieved by computing the derivatives of the iterates provided by a 
proximal splitting algorithm. In particular, this provides an estimate of div^(y) in (2) which allows to compute 
SURE(y). A Numerical example on a matrix completion problem is given to support our findings. 

2. Recursive risk estimation 

Proximal splitting Proximal splitting algorithms have become extremely popular to solve non-smooth convex 
optimization problems that arise often in inverse problems, e.g. (1). These algorithms provide a sequence 
of iterates X^ l \y) that provably converges to a solution X(y). A practical way to compute div/x(y), hence 
SURE(y), as initiated by Vonesch et al. (2008), and that we pursue here, consists in differentiating this sequence 
of iterates. This methodology has been extended to a wide class of proximal splitting schemes in (Deledalle et al., 
2012). For the sake of clarity, and without loss of generality, we focus on the case of the forward-backward (FB) 
splitting algorithm (Combettcs & Wajs, 2005). 

The FB scheme is a good candidate to solve (1) if J is simple, meaning that its proximity operator has a 
closed- form. Recall that the proximity operator of a lsc proper convex function G on W llXn2 is 

Prox G (X) = argmin \\X - Z\\% + G(Z). 

The FB algorithm iteration reads 

A^ +1 > =Prox TA . / (X« + rA*(y - A(X&))) (3) 

where A* denotes the adjoint operator of A, t > is chosen such that t||.A*.4|| < 2, the dependency of the 
iterate X" to y is dropped to lighten the notation. 

Risk estimation The divergence term div fi(y) is obtained by differentiating formula (3), which allows, for 
any vector S £ R p to compute iteratively = dX^ l \y)[8\ (the derivative of j/ 1 — ^ X^\y) at y in the direction 
5) as 

^^ 1 )=9Prox TAJ (sW)[CW] 
where = X w + rA*{y - A{X W )) 

and C^=^+tA*(S-A(^)). 

Using the Jacobian trace formula of the divergence, it can be easily seen that 

k 

div M (y) = E s (dn(y)[6], S) a ±£(0 Si) (4) 

i=\ 

where 8 ~ N (0, Idp) and Si are k realizations of S. The SURE(y) can in turn be iteratively estimated by plugging 
dti{y)[8 i ]=A{dXM{y)[8 i })m (4). 
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3. Local behavior of spectral functions 

This section studies the local behavior of real- and matrix- valued spectral functions. We write the (full) singular 
value decomposition (SVD) of a matrix X G W llXn2 

X = V x dmg(A x )U x 

(which might not be in general unique), where A x G K n is the vector of singular values of X with n = 111111(71!, n 2 ), 
diag(Ajf) G R™ 1><ri2 denotes the rectangular matrix with entries Ax on its main diagonal and otherwise, and 
V x G R" lXni and U x G W l2Xn2 are the unitary matrices of left and right singular vectors. 

3.1. Scalar- valued Spectral Functions 

A real-valued spectral function J can by definition be written as 

J(X) = <p(A x ) (5) 

where tp : R n — > M. is a symmetric function of its argument, meaning ip(PA) = 95(A) for any permutation matrix 
P G R nx " and A in the domain of ip. We extend ip to the negative half-line as ip(A) = <p(\A\). 

We then consider J a scalar- valued spectral function as defined in (5). From subdiffercntial calculus on spectral 
functions Lewis (1995), we get the following. 

Proposition 1. A spectral function J(X) = ip(A x ) is convex if and only if ip is convex, and then 

V 7 > 0, Prox 7 , 7 (A) = V x diag(Prox w (A x ));7x. 

3.2. Matrix-valued Spectral Functions 

We now turn to matrix-valued spectral functions 

F(X) = V x diag($(A x ))f/ x , (6) 

where $ : M. n — > M. n is symmetric in its arguments, meaning $oP = Po$ for any permutation matrix P G M ,ixn . 
We extend $ to negative numbers as $(A) = sign(A) $(|A|) and is the entry-wise matrix multiplication. 
One can observe that for F(X) = Prox 7 ,/(A) with $ = Prox 7V , the proximity operator of a convex scalar-valued 
spectral function is a matrix-valued spectral function. 

The following theorem provides a closed- form expression of the derivative of F when X is square, i.e. n\ = n<i = n, 
with distinct singular values. 

Theorem 1. For any matrix-valued spectral function F in (6), let the quantity 

VS g R" lX " 2 , D(X)[S) = V X (M(A X )[S] + T S (A X ) P S (S) + T A (A X ) V A (S))U* X 

where 5 = V X 5U X G K™ lX ™ 2 , the symmetric and anti- symmetric parts are defined, for 1 < i < n\ and 1 < j < ni, 
as 

and V A {Y\ 
and M{A) : I" 1 ™ 2 h-> W llXn2 is 

M(A) = diago5$(A) odiag. 



y ',j+ y j,i ^ i < n and j < n 
otherwise 

Yl ' 1 ^ Yl ' 1 if i < n and j < n 
^f- otherwise 
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The matrices r<?(A) and r^(A) are defined, for all 1 ^ i $J n\ and 1 ^ j < ?i2, as 

( if i = j 

r s (Ak,= * (A i;:* (A) ' A^A, 

[ 9$(A)j^ - 9$(A)jj otherwise, 

( if i = j 

r A(Ak,= ^ ( tt!! A)j A,>0 r A,>0 

[ 9$(A)i,j - 9$(A)ij otherwise. 

where for i > n we have extended A and $(A) as Aj = and $(A)j = 0. 

Assume that X is a square matrix, i.e. n\ = n-2 = n, and with distinct singular values, such that Aj ^ Aj /or all 
i ^ j. Then, a matrix-valued spectral function F is differentiable at X if and only if $ is differentiate at Ax- 
Moreover, 

VSe R nxn , dF(X)[8] = D(X)[6] 



The proof is given in Appendix A. 

Theorem 1 generalizes the result of Lewis & Sendov (2001) to square matrices that are not necessarily symmetric, 
and we recover their formula when X and 6 are symmetric matrices and X has distinct singular values. Regularity 
properties and expression of the directional derivative of symmetric matrix-valued separable spectral functions 
(possibly non-smooth) over non- necessarily symmetric matrices were also derived in Sun & Sun (2003). Before 
revising the previous version of this manuscript, Candes et al. (2012) brought to our attention their recent work 
on the SURE framework for parameter selection in dcnoising low-rank matrix data. Towards this goal, they 
provided closed-form expressions for the directional derivative and divergence of matrix- valued spectral functions 
over rectangular matrices with distinct singular values. They also addressed the case of complex- valued matrices. 

Although our proof of Theorem 1 is rigorously valid only for square matrices with distinct singular values, we 
conjecture that the formula of the directional derivative holds for rectangular matrices with repeated singular 
values. For the symmetric case with repeated eigenvalues, this assertion was formally proved in Lewis & Sendov 
(2001). As stated above, the full-rank rectangular case with distinct singular values was proved in Candes et al. 
(2012), where it was also shown that the divergence formula has a continuous extension to all matrices. 



4. Numerical applications 
4.1. Nuclear norm regularization 

We here consider the problem of recovering a low-rank matrix Xo S M" lXI12 . To this end, J is taken as the 
nuclear norm (a.k.a., trace or Schattcn 1-norm) which is in some sense the tightest convex relaxation to the 
NP-hard rank minimization problem (Candes & Recht, 2009). The nuclear norm is defined by 

J(X) = \X\.±[A X \ 1 . (7) 

Taking </(•) as || • ||* and <p as ||.||i in Proposition 1 gives: 
Corollary 1. The proximal operator 0/7I • ||» is 

V 7 >0, Pro X7 |.||.(X) = V x di ag (T 7 (Ax))C^ . (8) 
where T 7 = Prox 7 || ^ is the component-wise soft-thresholding, defined for i = 1, . . . , n as 

T 7 (t) 1 =max(0,l-7/||t i ||)t i . 



We now turn to the derivative of F = Prox^.^. A straightforward attempt is to take $ = Prox 7 || ^ = T 7 and 
apply Theorem 1 with 

0*(X mi = 8T ymi = {l if o JNjj (9) 
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Figure 1. Predicted risk and its SURE estimate^. 



However, strictly speaking, Theorem 1 does not apply since a proximity mapping is 1-Lipschitz in general, 
hence not necessarily diffcrentiablc everywhere. Thus, its derivative may be set-valued, as is the case for soft- 
thresholding at ±7. 

A direct consequence of Corollary 1 is that J is a simple function allowing for the use of the FB algorithm. 
Moreover, the expression of the derivative (9) provides an estimation of the SURE as explained in Section 2. 

4.2. Application to matrix completion 

We now exemplify the proposed SURE computation approach on a matrix completion problem encountered 
in recommendation systems such as the popular Netflix problem. We therefore consider the forward model 
y = A(Xq) + w G R p , W ~ A/"(0, cr 2 Idp), where Xq is a dense but low-rank (or approximately so) matrix and A 
is binary masking operator. 

We have taken (711,712) = (1000, 100) and P = 25000 observed entries (i.e., 25%). The underlying dense matrix 
Xq has been chosen to be approximately low-rank with a rapidly decaying spectrum Ax = {k~ }fe = i- The 
standard deviation a has been set such that the resulting minimum least-square estimate has a relative error 
\X-ls — ^o||f/||^o||f = 0.9. Figure 1 depicts the prediction risk and its SURE estimate as a function of A. For 
each value of A in the tested range, SURE(y) in (2) has been computed for a single realization of y with k — 4 
realizations Si in (4) 1 . At the optimal A value, X(y) has a rank of 55 with a relative error of 0.46 (i.e., a gain of 
about a factor 2 w.r.t. the least-square estimator). 

5. Conclusion 

The core theoretical contribution of this paper is the derivative of square matrix-valued spectral functions. This 
was a key step to compute the derivative of the proximal operator associated to the nuclear norm, and finally 
to use the SURE to recursively estimate the quadratic prediction risk of matrix recovery problems involving 
the nuclear norm regularization. The SURE was also used to automatically select the optimal regularization 
parameter. 

A. Summary of the proof of Theorem 1 

The following lemma derives the expression of the derivative of the SVD mapping X h-> (Vx, A-x,Ux)- Note 
that this mapping is not well defined because even if the Ax are distinct, one can apply arbitrary sign changes 
and permutations to the set of singular vectors. The lemma should thus be interpreted in the sense that one can 
locally write a Taylor expansion using the given differential for any particular choice of SVD. We point out that 




1 Without impacting the optimal choice of A, the two curves have been vertically shifted for visualization. 
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a proof of this lemma using wedge products can be found in (Edelman, 2005), but for the sake of completeness, 
we provide our own proof here. 

Lemma 1. We consider Xq G R nx ™ with distinct singular values. For any matrix X in a neighborhood of Xq, 
we can define without ambiguity the SVD mapping X t— > (Vx, Ax, Ux) by sorting the values in Ax and imposing 
sign constraints on Ux- The singular value mapping X i— > (Vx,Ax,Ux) is C 1 and for a given matrix 5, its 
directional derivative is 

dA x [S] = di&g(VxSU x ) , 
dV x [S} = V x iv, 
dU x [5]=U X iu 

where iy & M. nxn and lu G R nxn are defined, for all 1 ^ i < n and 1 ^ j < n, as 

M,j = 2 2 and (lu)^ = . (10) 

(Axjj - {Ax) l y A x) 3 - {Ax)i 

and where 6 = V£5U X £ K" x ". 

Proof. Let S n be the sub-space of Hermitian matrix in R nx ™. Let tp : R nxn x y — > R nx ™ x S„ x S n , where 

y = R nxn x R nxn x R n ^ be deflned for y = (T/ f7, A) G J 7 as 

ip(X,Y) = ( X -Vdiag(A)U*, V*V-Id, U*U -Id). 
We have for any vector £x e M nx ™ 

ai^(x,F)[Cx] = ( Cx, o, o ) (ii) 

and for any vector ( Y — (Cu, (v,(a) £ y 

d 2 TP(X,Y)[( Y } = ( -Cydiag(A)C7* -ydiag(A)C^-ydiag(C A )C/*, V*( V +( V V, U*(u + CIjU). 

Let Xq have distinct singular values and any of its SVD Yo = (Uq,Vq, Aq). We have ip(Xo,Yo) = 0. Moreover, 
denoting l v = V *( v e R nxn and i v = U^u € R nxn , for any z = {z x ,z 2 ,z z ) e M nx ™ x 5„ x £„, solving 
c?2V'(A^0j ^o)[Cr"] = 2 is equivalent to solving 

l v diag(Ao) + diag(A )tp + diag(C A ) = -V * Zl U = -Z\ (12) 

where ty + L v = z 2 and Lu + i v = Z3. Considering z 2 = and 2:3 = shows that and ij/ are antisymmetric. 
In particular they are zero along the diagonal. Thus applying the operator diag to both sides of (12) shows 
(a = — diag(Vg*zi£/o). Now considering the entries and of the linear system (12) shows that for any 
1 ^ i ^ n and 1 ^ i < n 



(A ) j -(A )A /(tv)< A _ (-(zi)u 
-(Ao), (Ao) 3 . J vMW 



(13) 



Since for i ^ j, (Ao)j ^ (Ao)j, these 2x2 symmetric linear systems can be solved. Then d 2 ip(Xo, Yq) is invertible 
on K" xn x 0„ x 0„ and for z = (zi,0, 0), its inverse is 

(d 2 i>(X ,Y ))- 1 [z] = (Voiv, EW, -dmg(V * Zl Uo) ) (14) 

where iy and i\j are given by the solutions of the above series of 2 x 2 symmetric linear systems. 

Since lm(diip(X,Y(X))) C R raxn x 0„ x 0„, we can apply the implicit function theorem (Rockafellar & Wets, 
2005). Hence, for any X S R" x " in the neighborhood of Xq, there exists a function Y(X) = (Ux,Vx,Ax) such 
that i/j(X, Y(X)) = 0, i.e. X admits an SVD. Moreover, this function is C 1 in the neighborhood of Xq and its 
differential is 

dY(X) = -WiXtYiX))- 1 od^{X,Y{X)). 

Injecting (11) and (14) gives the desired formula by solving (13) in closed form. Since Xq is any matrix with 
distinct singular values, we can conclude. □ 
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We now turn to the proof of the theorem. 

Proof. Since the singular values of X are all distinct, by composition of diffcrcntiablc functions, we can derive 
the relationship (6) that defines F which gives 

V£dF{X)[8\Ux = ^diag($(Ajr)) + diag($(A x )K + M(A X )[S] 

where we have used the notation introduced in Lemma 1. Using the expression (10) for ljj and tv shows that 
the matrix W = tv diag(4>(Ax)) + diag($(Ajf is computed as 

Wi,j = A2 = A2 (<Pj(A.j6i,j + MSj,i) - ipi(Ai5ij + AjS jti )) 

where if = $(A). Rearranging this expression using the symmetric and anti-symmetric parts shows the desired 
formula. □ 
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